Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP #19419

mtavenrath · 2024-02-05T14:51:52Z

Description

Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and reenable RNN tests for the CUDA EP.

Motivation and Context

The deprecated cuDNN RNN API might vanish soon and in addition for the current CUDA EP RNN implementation all RNN tests are disabled due to failures. With this change the deprecated API has been removed and the new updated implemented doesn't fail the tests anymore.

gedoensmax · 2024-02-05T14:55:14Z

@hariharans29 I believe we talked about some deprecated APIs via mail. Markus took it on to fix this. A review and probably guidance on testing would be much appreciated.

mtavenrath · 2024-02-06T17:26:33Z

@hariharans29 Can you please trigger the CI again? I accidentially removed a single } during cleanup of my PR.

mtavenrath · 2024-02-07T18:58:16Z

cuDNN v9.0.0 got released today. It removed the deprecated APIs I have replaced and thus the Cuda EP of onnxruntime will not compile anymore without this PR.

gedoensmax · 2024-02-07T19:06:11Z

@pranavsharma for viz due to cuDNN 9 discussions.

onnxruntime/core/providers/cuda/rnn/rnn.h

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h

hariharans29 · 2024-02-12T23:02:13Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

hariharans29 · 2024-02-12T23:02:20Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

hariharans29 · 2024-02-12T23:02:30Z

/azp run Big Models

azure-pipelines · 2024-02-12T23:02:40Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-02-12T23:02:56Z

Azure Pipelines successfully started running 7 pipeline(s).

mtavenrath · 2024-02-13T14:50:44Z

I've pushed an update which fixes one RNN test, lintrunner issues and linux compile warnings. For some reasons my Windows build doesn't show those even with /W3.

onnxruntime/core/providers/cuda/rnn/rnn.h

hariharans29 · 2024-02-13T18:17:56Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

hariharans29 · 2024-02-13T18:18:05Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

hariharans29 · 2024-02-13T18:18:14Z

/azp run Big Models

azure-pipelines · 2024-02-13T18:18:24Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-02-13T18:18:32Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-02-13T18:18:32Z

Azure Pipelines successfully started running 9 pipeline(s).

mtavenrath · 2024-02-13T20:50:33Z

The warning level on Windows and Linux is annoyingly different than the one on Linux. Unused local variables are supposed to be enabled with /W4 (C4189) where the default for ORT is /W3. Still with /W4 (or #pragma warning(3:4189) I'm still not able to enable this warning.

hariharans29 · 2024-02-14T06:45:27Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

hariharans29 · 2024-02-14T06:45:45Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

hariharans29 · 2024-02-14T06:45:59Z

/azp run Big Models

azure-pipelines · 2024-02-14T06:46:08Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-02-14T06:46:13Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-02-14T06:46:13Z

Azure Pipelines successfully started running 7 pipeline(s).

hariharans29 · 2024-02-15T20:12:10Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

mtavenrath · 2024-02-20T22:33:04Z

I've found one (potentially random) failing test on Windows. The problem with this test is that I cannot reproduce it on my local system (recent driver, RTX 6000 Ada). Which cuDNN version is being used, what kind of GPU in installed in the test machine and which driver version is being used?

1: [ OK ] GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLength (48 ms)
1: [ RUN ] GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLengthLinearBeforeReset
1: 2024-02-20 20:04:01.7835208 [E:onnxruntime:Default, cuda_call.cc:118 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle()));
1: 2024-02-20 20:04:01.7838589 [E:onnxruntime:Default, cuda_call.cc:118 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=412 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_));
1: D:\a_work\1\s\onnxruntime\test\providers\base_tester.cc(323): error: Expected equality of these values:
1: expect_result
1: Which is: 4-byte object <00-00 00-00>
1: ExpectResult::kExpectFailure
1: Which is: 4-byte object <01-00 00-00>
1: Run failed but expected success: CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle()));
1: Google Test trace:
1: D:\a_work\1\s\onnxruntime\test\providers\base_tester.cc(791): registered execution providers: CUDAExecutionProvider
1: Stack trace:

tianleiwu · 2024-02-20T23:02:21Z

The test is done in A10 GPU with CUDA 11.8 and cuDNN 8.5.0.96 (According to https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements).
@snnn, could you confirm the cuDNN verison and driver version?

…daMemcpy.

tianleiwu · 2024-02-21T23:28:00Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-02-21T23:28:01Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models

tianleiwu · 2024-02-21T23:28:02Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-02-21T23:28:19Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-02-21T23:28:35Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-02-21T23:28:36Z

Azure Pipelines successfully started running 9 pipeline(s).

mtavenrath · 2024-02-22T09:11:09Z

I was able to reproduce the failure by downgrading using cuDNN 8.5 for CUDA 11.8. Starting with cuDNN 8.9.1 one pointer is no longer required and this one was incorrect in general. I guess that most uses of cudnnRNNForward didn't use the sequence len buffer anymore except for the single one case in the failing test.

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc

tianleiwu · 2024-02-22T22:11:36Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-02-22T22:11:37Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Android CI Pipeline

tianleiwu · 2024-02-22T22:11:38Z

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-02-22T22:11:53Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2024-02-22T22:12:11Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-02-22T22:12:13Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc

tianleiwu · 2024-02-23T04:54:17Z

/azp run Big Models

azure-pipelines · 2024-02-23T04:54:29Z

Azure Pipelines successfully started running 1 pipeline(s).

mtavenrath · 2024-02-23T12:46:24Z

All CIs except for IOS succeeded. The IOS related failure is unrelated to this PR.

…NN v8 RNN API and reenable RNN tests for CUDA EP (#19419) Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and re-enable RNN tests for the CUDA EP. ### Motivation and Context The deprecated cuDNN RNN API might vanish soon and in addition for the current CUDA EP RNN implementation all RNN tests are disabled due to failures. With this change the deprecated API has been removed and the new updated implemented doesn't fail the tests anymore.

mtavenrath force-pushed the fix_cuda_rnn branch from 301e345 to fcea702 Compare February 6, 2024 16:33

mtavenrath changed the title ~~Replace deprecated cuDNN RNN APIs by new cuDNN v8 APIs and re-enable RNN tests which have been broken before.~~ Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP Feb 8, 2024

hariharans29 reviewed Feb 12, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/rnn.h Outdated Show resolved Hide resolved

hariharans29 reviewed Feb 12, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Feb 12, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h Show resolved Hide resolved

hariharans29 reviewed Feb 13, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/rnn.h Outdated Show resolved Hide resolved

Fix 120 column linter issues and remove stale debug device to host cu…

ec2955d

…daMemcpy.

Pass sequence_lens_buffer GPU pointer to cudnnRNNForward

1d2b016

mtavenrath commented Feb 22, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h Show resolved Hide resolved

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Show resolved Hide resolved

Fix another lintrunner issue

3ecf7d3

tianleiwu reviewed Feb 22, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Show resolved Hide resolved

tianleiwu reviewed Feb 22, 2024

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Show resolved Hide resolved

tianleiwu approved these changes Feb 23, 2024

View reviewed changes

tianleiwu merged commit efbe2b8 into microsoft:main Feb 23, 2024
80 of 81 checks passed

This was referenced Mar 15, 2024

Tritonserver 2402-py3 does not have onnxruntime backend triton-inference-server/server#6991

Closed

Errors in grpc_server.cc while installing onnxruntime backend triton-inference-server/server#6989

Closed

Craigacp mentioned this pull request Mar 18, 2024

Add CUDA12 support for Java's onnxruntime_gpu dependency #19960

Closed

Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP #19419

Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP #19419

Conversation

mtavenrath commented Feb 5, 2024

Description

Motivation and Context

gedoensmax commented Feb 5, 2024

mtavenrath commented Feb 6, 2024

mtavenrath commented Feb 7, 2024

gedoensmax commented Feb 7, 2024

hariharans29 commented Feb 12, 2024

hariharans29 commented Feb 12, 2024

hariharans29 commented Feb 12, 2024

azure-pipelines bot commented Feb 12, 2024

azure-pipelines bot commented Feb 12, 2024

mtavenrath commented Feb 13, 2024

hariharans29 commented Feb 13, 2024

hariharans29 commented Feb 13, 2024

hariharans29 commented Feb 13, 2024

azure-pipelines bot commented Feb 13, 2024

azure-pipelines bot commented Feb 13, 2024

azure-pipelines bot commented Feb 13, 2024

mtavenrath commented Feb 13, 2024

hariharans29 commented Feb 14, 2024

hariharans29 commented Feb 14, 2024

hariharans29 commented Feb 14, 2024

azure-pipelines bot commented Feb 14, 2024

azure-pipelines bot commented Feb 14, 2024

azure-pipelines bot commented Feb 14, 2024

hariharans29 commented Feb 15, 2024

mtavenrath commented Feb 20, 2024

tianleiwu commented Feb 20, 2024

tianleiwu commented Feb 21, 2024

tianleiwu commented Feb 21, 2024

tianleiwu commented Feb 21, 2024

azure-pipelines bot commented Feb 21, 2024

azure-pipelines bot commented Feb 21, 2024

azure-pipelines bot commented Feb 21, 2024

mtavenrath commented Feb 22, 2024

tianleiwu commented Feb 22, 2024

tianleiwu commented Feb 22, 2024

tianleiwu commented Feb 22, 2024

azure-pipelines bot commented Feb 22, 2024

azure-pipelines bot commented Feb 22, 2024

azure-pipelines bot commented Feb 22, 2024

tianleiwu commented Feb 23, 2024

azure-pipelines bot commented Feb 23, 2024

mtavenrath commented Feb 23, 2024